AITopics | inference acceleration

Collaborating Authors

inference acceleration

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Search for Efficient Large Language Models

Neural Information Processing SystemsMar-22-2026, 22:20:10 GMT

Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs.However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs.In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration.Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data.Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks.Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding

Zhang, Jiaru, Lu, Juanwu, Wang, Ziran, Zhang, Ruqi

arXiv.org Artificial IntelligenceJun-2-2025

Normalizing flows are promising generative models with advantages such as theoretical rigor, analytical log-likelihood computation, and end-to-end training. However, the architectural constraints to ensure invertibility and tractable Jacobian computation limit their expressive power and practical usability. Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality. However, such sequential modeling inherently restricts parallel computation during inference, leading to slow generation that impedes practical deployment. In this paper, we first identify that strict sequential dependency in inference is unnecessary to generate high-quality samples. We observe that patches in sequential modeling can also be approximated without strictly conditioning on all preceding patches. Moreover, the models tend to exhibit low dependency redundancy in the initial layer and higher redundancy in subsequent layers. Leveraging these observations, we propose a selective Jacobi decoding (SeJD) strategy that accelerates autoregressive inference through parallel iterative optimization. Theoretical analyses demonstrate the method's superlinear convergence rate and guarantee that the number of iterations required is no greater than the original sequential approach. Empirical evaluations across multiple datasets validate the generality and effectiveness of our acceleration technique. Experiments demonstrate substantial speed improvements up to 4.7 times faster inference while keeping the generation quality and fidelity.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.24791

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Search for Efficient Large Language Models

Neural Information Processing SystemsMay-27-2025, 21:50:10 GMT

inference acceleration, language model, llm, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A Theoretical Perspective for Speculative Decoding Algorithm

Yin, Ming, Chen, Minshuo, Huang, Kaixuan, Wang, Mengdi

arXiv.org Machine LearningOct-29-2024

Transformer-based autoregressive sampling has been the major bottleneck for slowing down large language model inferences. One effective way to accelerate inference is \emph{Speculative Decoding}, which employs a small model to sample a sequence of draft tokens and a large model to validate. Given its empirical effectiveness, the theoretical understanding of Speculative Decoding is falling behind. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, \emph{output quality and inference acceleration}, from a theoretical perspective. Our analysis covers the theoretical limits of speculative decoding, batch algorithms, and output quality-inference acceleration tradeoffs. Our results reveal the fundamental connections between different components of LLMs via total variation distances and show how they jointly affect the efficiency of decoding algorithms.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Machine Learning

2411.00841

Country:

South America > Brazil (0.04)
Asia > Middle East > Qatar (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Parallel Speculative Decoding with Adaptive Draft Length

Liu, Tianyu, Li, Yun, Lv, Qitan, Liu, Kai, Zhu, Jianchen, Hu, Winston

arXiv.org Artificial IntelligenceSep-4-2024

Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely \textbf{P}arallel sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft \textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to verify the first draft token in advance during the drafting phase, and \textit{post-verify} to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Moreover, we theoretically demonstrate that the mean accepted tokens of PEARL is more than existing \textit{draft-then-verify} works. Experiments on various text generation benchmarks demonstrate the effectiveness of our \name, leading to a superior speedup performance up to \textbf{3.79$\times$} and \textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla speculative decoding, respectively.

draft model, draft token, pearl, (13 more...)

arXiv.org Artificial Intelligence

2408.1185

Country:

North America > United States (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Inference acceleration for large language models using "stairs" assisted greedy generation

Grigaliūnas, Domas, Lukoševičius, Mantas

arXiv.org Artificial IntelligenceJul-29-2024

Large Language Models (LLMs) with billions of parameters are known for their impressive predicting capabilities but require lots of resources to run. With their massive rise in popularity, even a small reduction in required resources could have an impact on environment. On the other hand, smaller models require fewer resources but may sacrifice accuracy. In this work, we are proposing an implementation of ``stairs'' assisted greedy generation. It is a modified assisted generation methodology that makes use of a smaller model's fast generation, large model's batch prediction, and "stairs" validation in order to achieve a speed up in prediction generation. Results show between 9.58 and 17.24 percent inference time reduction compared to a stand-alone large LLM prediction in a text generation task without a loss in accuracy.

experiment, greedy generation, prediction, (16 more...)

arXiv.org Artificial Intelligence

2407.19947

Country:

Europe > Lithuania > Kaunas County > Kaunas (0.05)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Inference Acceleration for Large Language Models on CPUs

PS, Ditto, VG, Jithin, MS, Adarsh

arXiv.org Artificial IntelligenceMar-4-2024

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands. In this paper, we explore the utilization of CPUs for accelerating the inference of large language models. Specifically, we introduce a parallelized approach to enhance throughput by 1) Exploiting the parallel processing capabilities of modern CPU architectures, 2) Batching the inference request. Our evaluation shows the accelerated inference engine gives an 18-22x improvement in the generated token per sec. The improvement is more with longer sequence and larger models. In addition to this, we can also run multiple workers in the same machine with NUMA node isolation to further improvement in tokens/s. Table 2, we have received 4x additional improvement with 4 workers. This would also make Gen-AI based products and companies environment friendly, our estimates shows that CPU usage for Inference could reduce the power consumption of LLMs by 48.9% while providing production ready throughput and latency.

gen intel xeon scalable processor, intel xeon scalable processor, utilization, (11 more...)

arXiv.org Artificial Intelligence

2406.07553

Genre: Research Report (0.43)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

Li, Nan, Iosifidis, Alexandros, Zhang, Qi

arXiv.org Artificial IntelligenceJul-22-2022

This paper studies inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing. To ensure inference accuracy in inference task partitioning, we consider the receptive-field when performing segment-based partitioning. To maximize the parallelization between the communication and computing processes, thereby minimizing the total inference time of an inference task, we design a novel task collaboration scheme in which the overlapping zone of the sub-tasks on secondary edge servers (ESs) is executed on the host ES, named as HALP. We further extend HALP to the scenario of multiple tasks. Experimental results show that HALP can accelerate CNN inference in VGG-16 by 1.7-2.0x for a single task and 1.7-1.8x for 4 tasks per batch on GTX 1080TI and JETSON AGX Xavier, which outperforms the state-of-the-art work MoDNN. Moreover, we evaluate the service reliability under time-variant channel, which shows that HALP is an effective solution to ensure high service reliability with strict service deadline.

artificial intelligence, inference time, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICC45855.2022.9839083.

2207.11294

Country: Europe (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

Xilinx Selects Mipsology Zebra Software to Accelerate Alveo U50 FPGA – IAM Network

#artificialintelligenceJun-28-2020, 16:50:34 GMT

AI software innovator Mipsology today announced that its Zebra neural network accelerating software has been integrated into the latest build of Xilinx's Alveo U50 data center accelerator card, the industry's first low profile adaptable accelerator with PCIe Gen 4 support. Zebra's ease-of-use and high throughput enable the Alveo U50 to compute convolutional neural networks with zero effort. This is the latest in a series of Zebra-enhanced Xilinx boards that enable inference acceleration for a wide variety of sophisticated AI applications. "The level of acceleration that Zebra brings to our Alveo cards puts CPU and GPU accelerators to shame," said Ramine Roane, Xilinx's Vice President of marketing. "Combined with Zebra, Alveo U50 meets the flexibility and performance needs of AI workloads and offers high throughput and low latency performance advantages to any deployment."

artificial intelligence, machine learning, xilinx select mipsology zebra software, (6 more...)

#artificialintelligence

Country: North America > United States > California (0.08)

Industry: Semiconductors & Electronics (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.42)

Add feedback

Amazon Elastic Inference adds support for PyTorch machine learning models - SiliconANGLE

#artificialintelligenceApr-14-2020, 23:03:46 GMT

Amazon Web Services Inc. announced today that it's adding support for PyTorch models with its Amazon Elastic Inference service, which it said will help developers reduce the costs of deep learning inference by as much as 75% in some cases. Amazon Elastic Inference is a service launched in late 2018 that enables customers to attach graphics processing unit-powered inference acceleration to a standard Amazon EC2 instance. Inference refers to the process of making predictions using a trained deep learning model. PyTorch is an open-source machine learning library that was first developed by Facebook Inc. It's used primarily for applications such as computer vision and natural language processing.

deep learning model, inference acceleration, siliconangle, (11 more...)

#artificialintelligence

Country: North America > United States > California (0.06)

Genre: Press Release (1.00)

Industry: Information Technology > Services (0.57)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback